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The  Supercomputer  Toolkit  and  its  Applications 
1  Introduction 

The  Supercomputer  Toolkit  is  a  proposed  family  of  standard  hardware  and 
software  components  from  which  special-purpose  machines  can  be  easily  con¬ 
figured.  Using  the  Toolkit,  a  scientist  or  am  engineer,  starting  with  a  suitable 
computational  problem,  will  be  able  to  readily  configure  a  special-purpose 
multiprocessor  that  attains  supercomputer-class  performamce  on  that  prob¬ 
lem,  at  a  fraction  of  the  cost  of  a  general-purpose  supercomputer. 

Each  type  of  Toolkit  hardware  module  will  be  implemented  as  an  individ¬ 
ual  board.  The  boards  fit  into  a  common  chassis  that  furnishes  only  power 
and  ground.  Special  cables  are  used  to  achieve  high-speed  communication 
among  boards  and  to  distribute  the  clock.  A  user  assembles  a  machine  by 
plugging  in  the  required  modules  and  connecting  the  cables  appropriately. 
When  a  particular  machine  is  no  longer  needed,  it  can  be  disassembled, 
and  its  modules  can  be  reassembled  into  other  configurations.  As  of  June, 
1990,  we  have  designed,  fabricated,  and  are  beginning  to  benchmark  the  ba¬ 
sic  Toolkit  processor  module,  tailored  for  high-performance  double-precision 
floating-point  operations.  A  typical  configuration  will  include  several  proces¬ 
sor  modules.  Other  hardware  modules  we  hope  to  develop  will  provide  for 
mass  memory  and  high-speed  data-acquisition. 

The  intent  of  this  arrangement  is  to  make  it  simple  and  relatively  in¬ 
expensive  to  configure  special-purpose  computational  engines.  Yet  even  if 
appropriate  hardware  modules  were  readily  available,  these  would  not  be 
of  much  use  if  programming  each  new  machine  entailed  a  major  software- 
development  effort,  or  required  an  intense  analysis  to  exploit  the  available 
parallelism  effectively.  We  believe  that,  for  suitable  scientific-computing  ap¬ 
plications,  one  can  compile  extremely  high-performance  code  from  high-level 
languages,  and  moreover,  that  the  compiler  can  automatically  synthesize  a 
pattern  of  interconnection  well-matched  to  the  program  being  compiled,  as 
well  a a  automatically  schedule  the  computation  to  make  effective  use  of  the 
available  parallelism.  In  addition  to  this  novel  compiler,  the  software  support 
for  the  Toolkit  will  include  an  assembler,  a  simulator,  and  debugging  tools. 
There  will  also  be  standard  software  components,  such  as  a  scientific  library, 
for  inclusion  in  Toolkit  programs. 


2 


We  envision,  that  the  Toolkit  will  be  used  as  follows: 

One  begins  with  an  algorithm  that  performs  the  costly  inner  loop  of  a 
computation  that  is  important  enough  to  warrant  constructing  a  special- 
purpose  machine.  For  example,  the  simulation  part  of  a  multidimensional 
optimization  in  the  computer-aided  design  of  am  an  ado  g  circuit,  or  the  inte¬ 
gration  of  the  differential  equations  required  to  achieve  the  real-time  control 
of  a  nonlinear  process,  are  appropriate  for  Toolkit  implementation. 

The  Toolkit  software  will  be  used  to  compile  the  program,  targeted  for 
a  number  of  different  Toolkit  hardware  configurations,  some  proposed  b> 
the  user,  others  generated  automatically  by  the  Toolkit  compiler  itself.  The 
compiler  will  also  produce,  for  each  configuration,  a  simulation  that  the  user 
can  run  on  the  host  machine  to  help  evaluate  price-performance  tradeoffs. 
After  a  configuration  has  been  selected,  the  user  will  obtain  the  required 
modules,  wire  them  together,  and  connect  the  machine  he  has  built  to  a  host 
computer.  The  configuration  will  be  verified  by  means  of  diagnostics  that 
are  automatically  generated  and  loaded  from  the  host.  The  target  program 
will  then  be  loaded,  and  the  new  machine  will  be  ready  to  be  used  by  host 
programs  as  a  back-end  processor. 

2  Historical  Motivation 

The  Digital  Orrery  [2],  constructed  in  1983-1984,  is  a  special-purpose  numer¬ 
ical  engine  optimized  for  high-precision  numerical  integrations  of  the  equa¬ 
tions  of  motion  of  small  numbers  of  gravitationally  interacting  bodies.  Using 
1980  technology,  the  device  is  about  1  cubic  foot  of  electronics,  dissipating 
150  watts.  On  the  problem  it  was  designed  to  solve,  it  was  measured  to  be 
60  times  faster  than  a  VAX  11/780  with  FPA,  or  1/3  the  speed  of  a  Cray  IS. 

The  Orrery  achieves  this  performance  at  modest  cost  for  two  reasons. 
Its  communication  paths  are  specialized  for  the  solar-system  problem.  It  is 
organized  as  a  ring  of  up  to  ten  processing  elements,  one  for  each  body  to 
be  simulated.  The  algorithm  passes  the  states  of  the  n  bodies  around  the 
ring,  allowing  the  computation  of  all  n 2  accelerations  in  order  n  time,  with 
negligible  communication  cost.  Additionally,  the  program  that  performs  the 
integration  completely  exploits  the  data-independence  that  is  inherent  in  the 
problem.  All  available  cycles  jure  used  for  floating-point  operations;  none  are 
used  to  support  data-structure  references. 
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In  1988,  G.  Sassman  and  J.  Wisdom  used  the  Orrery  to  demonstrate  that 
the  long-term  motion,  of  the  planet  Pluto,  and  by  implication  the  dynamics  of 
the  Solar  System,  is  chaotic  [3].  This  required  integrating  the  positions  of  the 
outer  planets  for  a  simulated  time  of  845  million  years,  which  required  run¬ 
ning  the  Orrery  continuously  for  more  than  three  months.  Before  the  Orrery, 
high-precision  integrations  over  simulated  millions  of  years  were  prohibitively 
expensive,  and  astrophysicists  had  done  only  a  few  small  experiments  using 
carefully  scheduled  resources. 

The  objective  of  our  work  is  to  generalize  and  automate  the  preparation 
of  such  computing  instruments.  Sorting  from  a  mathematical  description  of 
am  application — for  example  the  equations  of  motion  of  the  outer  planets — a 
scientist  should  be  able  to  use  the  Toolkit  to  build  a  modem  version  of  the 
Digital  Orrery  in  about  a  week  of  effort,  complete  with  software.  With  the 
same  components,  and  with  a  similar  amount  of  effort,  an  engineer  should 
be  able  to  configure  a  machine,  with  software,  to  optimize  the  design  of  a 
high-frequency  nonlinear  circuit  such  as  a  phase-locked  loop. 

3  Applications 

The  ability  to  easily  configure  spedad-purpose  hardware  opens  up  a  variety  of 
important  applications  that  rely  upon  the  ability  to  perform  high-prerision 
simulations  in  real-time  or  faster  than  real-time. 

For  example,  hardware-in-the-loop  techniques  are  used  in  the  develop¬ 
ment  of  mechanical  systems — the  design  of  a  mechanical  assembly  may  be 
simplified  by  instrumenting  already-designed  physical  parts  and  coupling 
these  to  actuators  driven  by  simulations  of  other  parts  of  the  assembly.  Usu¬ 
ally  this  is  done  with  analog  or  hybrid  computers,  but  spedal-purpose  digital 
systems  configured  from  general  components  could  be  cheaper,  more  accu¬ 
rate,  and  much  more  flexible. 

In  the  automatic  control  of  highly  nonlinear  plants,  there  are  techniques 
that  rely  upon  being  able  to  simulate  the  dynamics  of  the  plant  faster  than 
real  time,  so  as  to  predict  the  consequences  of  proposed  control  actions.  Often 
it  is  desirable  to  operate  a  plant  dose  to  a  point  of  catastrophic  failure.  The 
extent  to  which  such  control  strategies  can  be  safely  implemented  depends 
upon  the  quality  of  the  dynamical  model  of  the  plant  and  upon  the  speed  of 
computation  available  to  the  control  engineer.  General-purpose  computers 
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with  physical  characteristics  appropriate  for  use  in  controllers  are  inadequate 
for  this  use  in  all  but  very  slow  systems. 

Alternatively,  consider  the  situation  of  an  electrical  engineer  optimizing 
the  design  of  an  important  nonlinear  circuit,  such  as  an  analog-to-digital 
converter.  Evaluating  each  choice  of  device  parameters  requires  a  difficult 
simulation  that  needs  many  hours  of  time  on  a  workstation-class  computer. 
Typically,  the  engineer  will  run  a  simulation  overnight  and  adjust  the  param¬ 
eters  after  evaluating  the  result  the  next  day.  It  is  not  uncommon  for  this 
work  to  continue  for  several  months.  With  the  Toolkit,  the  engineer  could  in¬ 
stead  invest  a  week’s  effort,  analyzing  the  problem  and  evaluating  alternative 
Toolkit  configurations,  to  design  and  configure  a  special  computer  to  speed 
up  the  simulation.  If  each  simulation  required  half  a  minute  rather  than  5 
hours,  the  optimization  could  be  performed  using  automatic  algorithms;  in  a 
week  of  continuous  running,  a  program  could  achieve  a  better  optimum  than 
manual  methods  could  ever  discover. 

4  Hardware 

The  basic  Toolkit  processor  module  contains  a  few  arithmetic  execution  units, 
a  small  high-speed  multiport  memory,  and  a  simple  controller.  In  our  pro¬ 
totype,  each  processor  module  may  connect  to  other  modules  via  two  bi¬ 
directional  1/ 0  ports,  each  of  which  may  connect  to  other  units  (Each  module 
can  connect  to  about  ten  others,  but  we  have  not  yet  determined  the  limits 
here).  All  the  modules  and  communication  paths  of  a  Toolkit  configuration 
are  synchronized  by  a  common  clock.  One  can  configure  any  interprocessor 
connection  graph,  within  the  fan-out  limits,  by  using  a  processor  for  every 
branch,  where  the  interconnections  are  the  nodes  (see  figure  1). 

In  our  prototype,  each  board  has  a  peak  scalar  floating-point  speed  of  28 
double-precision  Mflops,  and  we  expect  to  be  able  to  sustain  performance 
of  about  half  this  rate  (per  board)  on  real  problems.  The  current  design  is 
constructed  from  off-the-shelf  components,  and  can  be  easily  duplicated  at 
modest  cost. 

Figure  2  shows  the  overall  structure  of  the  processor  module. 

Our  goal  in  designing  this  board  was  to  use  the  fastest  floating-point 
chips  available  and  to  provide  enough  bandwidth  to  keep  them  fully  utilized. 
We  chose  the  two-chip  (ALU  and  multiplier)  floating-point  chip  set  made 
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Figure  1:  Each  processor  module  has  two  bidirectional  I/O  ports.  The  figure  shows  how 
this  allows  one  to  build  various  network  architectures:  a  mesh,  a  ring  and  communicating 
clusters. 

by  Bipolar  Integrated  Technologies  (B.I.T.)  and  the  fastest  easily-available 
memory  (20-ns  16Kx4  SRAM). 

Th-  floating-point  unit  (FPU)  can  multiply  two  64-bit  arguments  during 
the  time  it  takes  to  transfer  one  word  from  memory.  Thus,  our  desire  to 
obtain  a  balanced  system,  in  which  the  FPU  is  not  starved  by  the  memory, 
required  that  there  should  be  two  separate  memories.  The  memories  commu¬ 
nicate  with  the  FPU  via  a  32-entry  register  array  with  5  ports:  a  read/write 
port  to  each  memory,  two  read  ports  that  supply  floating-point  arguments, 
and  a  write  port  for  the  floating-point  result.  The  register  array  is  config¬ 
ured  from  four  B.I.T.  5-port  18-bit  register-file  chips.  (This  required  some 
clever  design  and  a  delicate  clocking  scheme.)  All  of  the  data  paths  in  our 
prototype  are  byte-parity  protected. 

Addresses  are  supplied  to  each  memory  by  its  own  address  generator, 
which  was  implemented  with  a  16-bit  wide  2901-style  microprocessor.  Con¬ 
trol  for  the  entire  processor  module  is  expressed  with  a  very  long  instruction 
word — 168  bits  of  horizontal  code — that  are  stored  in  a  16K  deep  micropro¬ 
gram  memory.  The  memory  is  implemented  with  the  same  kind  of  16Kx4 
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Figure  2:  This  is  the  overall  architecture  of  the  prototype  processor  module,  consisting  of 
a  fast  floating-point  chip  set,  a  5-port  register  file,  two  memories  and  address  generators, 
and  a  sequencer. 
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SRAMs  that  we  used  for  the  data  memories.  The  microcode  memory  is  ad¬ 
dressed  using  an  16-bit  wide  2910-style  microprogram  sequencer,  which  also 
provides  limited  subroutine  and  branching  capabilities. 

We  chose  the  very  long  instruction  word  format  because  during  each  cycle 
(about  70  ns)  an  instruction  needs  to  specify  independent  operations  for  the 
multiplier,  the  ALU,  transfers  among  the  registers,  memories,  and  the  I/O 
ports,  an  instruction  for  each  of  the  address  generators,  and  an  operation  for 
the  sequencer. 

Figure  3  shows  the  layout  of  the  prototype  processor  module  on  a  13”  x  15” 
board,  which  fits  into  a  standard  HP  chassis.  We  expect  to  assemble  a 
machine  with  5  to  10  of  these  boards  during  the  summer  of  1990. 

To  build  a  system  with  several  boards  we  interconnect  their  I/O  ports 
using  controlled- impedance  transmission  lines,  terminated  at  the  ends.  Each 
port  can  be  used  to  transmit  a  64- bit  word  between  processors  in  two  cycles. 
As  there  is  no  hardware  arbitration  on  the  I/O  ports,  it  is  necessary  that 
the  programmer  develop  a  convention  for  controlling  Access  to  each  commu¬ 
nication  channel.  To  prevent  bad  programs  from  burning  up  the  drivers  the 
ports  are  implemented  using  open- collector  TTL  transceivers  that  can  drive 
impedances  as  low  as  30  Ohms. 

To  avoid  reflections  the  transmission  lines  are  never  branched.  They 
enter  the  board  on  one  connector,  Are  routed  to  transceivers  and  then  exit 
the  board  on  another  connector.  Careful  layout  minimizes  stubs  along  the 
interconnect  path.  The  impedance  on  the  board  is  the  same  as  the  impedance 
of  the  ribbon  cable  used  for  interconnect.1 

Since  each  board  has  two  I/O  ports,  rearranging  cables  permits  one  to 
statically  configure  any  interconnection  scheme  (within  fanout  limits),  in 
which  each  processor  may  communicate  with  two  distinct  sets  of  neighbors. 
For  example,  figure  4  shows  how  one  uses  this  scheme  to  configure  a  4- 
processor  cluster. 

The  entire  machine  is  intended  to  be  a  back-end  computer  that  communi¬ 
cates  with  a  host  computer  via  a  parallel  interface.  Communication  with  the 
host  is  significantly  slower  than  communication  between  boards.  Thus,  the 
present  prototype  is  best  suited  for  computations  where  only  a  small  amount 
of  data  is  transferred  between  the  Toolkit  processors  and  the  host. 

1  Henry  Wu  was  instrumental  in  developing  this  interconnect  technology  for  our  boards. 
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Figure  3:  Layout  of  the  prototype  on  a  13”  x  15”  board. 


Figure  4:  Interconnection  between  modules  is  accomplished  by  transmission  lines,  al¬ 
lowing  one  to  statically  configure  any  interconnection  network  in  which  each  processor  is 
connected  to  at  most  two  nodes.  The  figure  shows  how  to  connect  cables  to  create  two 
communicating  4-processor  clusters.  The  boxes  marked  “T"  are  terminators. 
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Software 


5.1  Low-Level  Programming  Model 

Each  supercomputer  Toolkit  processor  is  programmed  as  a  Very  Long  In¬ 
struction  Word  (VLIW)  computer.  In  every  cycle,  the  following  operations 
can  be  performed  in  parallel  (see  figure  2): 

*  Two  memory  transactions,  one  to  left  memory  and  one  to  right  memory. 
Each  memory  can  perform  a  load  or  store  operation  with  the  register  file  on 
each  cycle. 

*  Two  memory  address  computations,  to  generate  the  addresses  that  will 
be  used  to  access  the  memories  during  the  following  cycle.  The  address 
generators  have  their  own  internal  register  files  to  support  these  operations. 

*  One  program-counter  operation  -  conditional  branch,  jump,  call,  push/pop, 

etc. 

*  One  floating-point  ALU  operation  and  one  floating-point  multiply  op¬ 
eration.  The  ALU  and  multiplier  receive  their  inputs  and  store  their  results 
into  the  main  register-file. 

Both  arithmetic  chips  can  be  operated  simultaneously.  However,  since 
both  the  ALU  and  the  multiplier  share  register-file  ports,  it  is  necessary  that 
they  do  not  simultaneously  require  access  to  the  register-file.  For  example, 
while  the  multiplier  is  busy  doing  an  operation  such  as  square-root,  that  takes 
several  cycles  to  complete,  the  register-file  ports  can  be  used  to  supply  data 
to  the  ALU.  Operations  such  as  multiply- accumulate  use  intem<il  feedback 
paths  within  the  arithmetic  chips,  thereby  freeing  up  register-file  ports. 

*  Each  processor  has  two  I/O  ports,  each  of  which  is  connected  to  a 
communication  channel.  Two  cycles  axe  required  to  transmit  a  single  64-bit 
word.  Accessing  an  I/O  port  uses  the  internal  memory  bus  for  one  cycle. 
Thus,  a  LEFT  I/O  operation  and  a  LEFT  memory  operation  can  not  both 
be  performed  during  the  same  cycle. 

When  multiple  processors  are  to  be  used  for  a  single  application,  several 
programming  styles  are  possible.  The  simplest  style  is  to  have  the  program 
counters  on  all  of  the  boards  act  in  lock-step,  effectively  forming  a  multiple 
board  VLIW  machine.  An  alternative  is  to  program  the  processors  in  a 
MIMD  style.  In  the  MIMD  style,  the  processors  run  totally  independent 
programs,  exchanging  messages  via  the  communication  channels  as  needed. 

To  support  more  complex  programming  styles  that  combine  aspects  of 
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both  the  VLIW  and  MIMD  styles,  the  hardware  provides  a  wired-or  flag  for 
synchronizing  control  among  multiple  boards.  For  example,  in  the  integration 
of  differential  equations,  where  different  state  variables  have  large  variations 
in  time  scale,  it  is  advantageous  to  use  integrators  that  admit  variable  and 
individual  stepsizes.  In  such  systems,  some  parts  of  the  process  can  proceed 
in  VLIW  fashion,  counting  out  cycles  to  maintain  synchronization,  but  other 
parts  may  need  explicit  synchronization  to  keep  the  individual  state  variable 
integrators  in  step. 

5.2  Compilation 

We  intend  to  automatically  compile  and  schedule  high-performance  code  for 
multiple  Toolkit  modules  and  automatically  generate  an  appropriate  pattern 
of  interconnect,  but  we  have  not  done  that  yet.  Certainly,  the  task  of  pro¬ 
gramming  parallel  machines  in  general  is  extremely  difficult.  However,  we 
believe  that  there  are  special  characteristics  of  common  numerical  methods 
that  make  automatic  scheduling  and  network  generation  feasible  for  a  large 
class  of  important  scientific  and  engineering  applications. 

On  the  other  hand,  one  can  make  progress  using  more  modest  software 
support.  The  Orrery  was  programmed  using  a  fairly  simple  symbolic  mi¬ 
crocode  assembler.  This  was  possible  since  the  solar-system  simulation  is 
not  a  very  complicated  program.  The  partitioning  of  the  problem  into  pro¬ 
cesses,  the  assignment  of  these  processes  to  processors,  and  the  programming 
of  the  connections  between  processors  can  be  derived  from  knowledge  of  the 
problem. 

This  kind  of  low-level  programming  can  be  done  with  the  Toolkit  now. 
However,  we  have  developed  a  compiler  that  automates  the  process  of  build¬ 
ing  Orrery-like  programs.  A  user  specifies,  in  a  high-level  language,  the 
straight-line  program  to  be  executed  in  each  processor  separately.  These 
fragments  can  be  manually  glued  together  to  allow  simple  communication 
patterns  and  to  construct  loops. 

The  compiler,  built  by  Andy  Berlin  and  Bill  Rozas,  generates  efficient 
code  by  using  partial  evaluation  [4,  5]  to  “flatten”  a  program.  This  produces 
code  that  contains  extremely  long  straight-line  sequences  of  numerical  op¬ 
erations  (often  several  thousand  operations  long).  This  makes  it  feasible  to 
re-order  operations  to  account  for  pipeline  delays,  allowing  the  floating-point 
units  to  be  fully  utilized.  In  addition,  this  allows  data  motion  instructions, 
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such  as  memory  fetches,  to  be  initiated  far  in  advance  of  the  numerical  oper¬ 
ation  that  needs  the  data.  Work  on  the  Supercomputer  Toolkit  compiler  has 
progressed  to  the  point  where  we  can  schedule  the  a  solar- system  program  in 
such  a  way  as  to  keep  one  processor  fully  utilized.  We  are  now  working  on 
generalizing  this  approach  to  schedule  code  for  multiple  Toolkit  processors. 


5.3  The  Dynamicist’s  Workbench 

Ultimately  we  expect  the  Toolkit  to  be  the  workhorse  for  the  Dynamicist’s 
Workbench,  a  tool  that  will  aid  scientists  and  engineers  in  the  simulation 
and  analysis  of  dynamical  systems.  The  Workbench  includes  a  spectrum 
of  computational  tools — including  numerical  methods  and  symbolic  algebra. 
These  tools  are  designed  so  that  combined  methods,  tailored  to  particular 
problems,  can  be  constructed  on  the  fly. 

For  example,  one  can  specify  a  circuit  optimization  problem  in  terms  of 
the  circuit  diagram.  One  can  investigate  the  dynamics  of  a  double  pendulum 
in  terms  of  a  Lagrangian  that  describes  it.  The  Dynamicist’s  Workbench 
starts  with  such  descriptions  and  constructs  appropriate  numerical  proce¬ 
dures  for  simulations  and  optimizations.  It  automatically  prepares  varia¬ 
tional  equations  and  sensitivity  analysis  codes. 

Parts  of  the  programs  generated  by  the  Dynamicist’s  Workbench  are 
further  compiled  by  the  Toolkit  compiler  to  make  microcode  for  individual 
Toolkit  boards.  Other  parts  of  the  Workbench  code  will  be  used  to  construct 
host-interface  software  and  analysis  code  to  be  run  in  the  host. 

6  Summary 

The  Toolkit  project  is  not  meant  to  address  the  difficult  issues  of  large-scale 
parallel  computation.  Neither  the  hardware  architecture  we  propose,  nor 
the  interconnection  technology,  nor  in  all  likelihood  our  software  ideas 
be  expected  to  scale  to  systems  with  many  hundreds  of  processors.  Our 
goal  is  to  realize  means,  practical  within  the  limits  of  current  technology,  to 
provide  relatively  inexpensive  supercomputer  performance  for  a  limited,  but 
important  class  of  problems  in  science  and  engineering.  We  expect  even  our 
prototype  implementation  to  be  useful  for  problems  modeled  with  systems 
of  ordinary  differential  equations.  Additional  Toolkit  modules  that  we  hope 
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to  develop  may  make  other  applications  feasible,  but  we  have  not  discussed 
applications  here  that  require  large  memory  or  for  which  an  appropriate 
Toolkit  configuration  would  be  bigger  than  a  few  boards.  Simulation  of  fluid 
flow  is  one  such  example.  Other  examples  can  be  found  in  [8]. 

Efforts  with  similar  goals  include  the  NuMesh  effort  at  MIT,  and  the 
iWARP  work  at  CMU  [7]. 

There  are  other  promising  strategies  for  parallel  computation,  represented 
by  machines  such  as  the  MIT  Monsoon  Dataflow  machine,  the  Connection 
Machine,  the  Multiflow  computer,  and  many  others.  These  are  general- 
purpose  machines.  Our  idea  differs  in  that  we  intend  to  statically  configure 
both  hardware  and  software  for  each  particular  problem.  Thus  we  require  no 
general-purpose  software  (such  as  an  operating  system),  no  routing  protocols, 
and  no  hardware  to  support  these  features.  We  believe  that  when  attempting 
to  obtain  maximum  performance  for  a  fixed  level  of  technology,  we  cannot 
afford  to  pay  the  price  of  features  intended  to  support  generality. 

As  a  result  of  its  high  performance,  relative  ease  of  programming  and  low- 
cost  we  expect  the  Toolkit  to  have  2m  impact  on  scientific  and  engineering 
computation. 
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